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Asymptotic convergence rates for coordinate descent 
in polyhedral sets 


Olivier Bilenne 


Abstract We consider a family of parallel methods for constrained optimiza¬ 
tion based on projected gradient descents along individual coordinate directions. 
In the case of polyhedral feasible sets, local convergence towards a regular so¬ 
lution occurs unconstrained in a reduced space, allowing for the computation 
of tight asymptotic convergence rates by sensitivity analysis, this even when 
global convergence rates are unavailable or too conservative. We derive lin¬ 
ear asymptotic rates of convergence in polyhedra for variants of the coordinate 
descent approach, including cyclic, synchronous, and random modes of imple¬ 
mentation. Our results find application in stochastic optimization, and with 
recently proposed optimization algorithms based on Taylor approximations of 
the Newton step. 


1 Introduction 

The interest for the coordinate descent methods lies in their simplicity of im¬ 
plementation and flexibility m- Yet their performances in terms of speed of 
convergence are generally modest compared to their centralized counterparts 
and still subject to active research. In this work we derive the asymptotic 
convergence rates of parallel implementations of the gradient projection algo¬ 
rithm [3] in the context of the constrained minimization of a strictly convex, 
continuously differentiable function over a polyhedral feasible set—this class 
of problems is met for instance in bound-constrained optimization or in dual 
optimization. Our developments rely on the property of projected gradient 
methods to asymptotically behave, when applied in a polyhedral feasible set 
specified by a collection of affine inequality constraints, like unconstrained gra¬ 
dient descents on the surface of the polyhedron, provided that the gradient of 
the cost function at the point of convergence be a negative combination of the 
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normal vectors of the active constraints. This property facilitates the deriva¬ 
tion of rates of convergence in the form of matrices playing roles analogous 
to those of system matrices in linear state space models, thus reducing the 
question of the convergence to the spectral analysis of matrices. 

Outline — Section[2]formulates the gradient projection algorithm and iden¬ 
tifies certain properties enjoyed by the method in polyhedral sets. From the 
initial algorithm we derive parallelized implementations operating gradient 
descents along coordinate directions, each of them characterized by the way 
these operations are organized (synchronously, cyclically, randomly, etc.). In 
Section [3] we compute asymptotic convergence rates for the parallel algorithms 
under the hypothesis of twice continuous differentiability at the point of con¬ 
vergence. Our developments are then reconsidered for non-twice differentiable 
cost functions and from the perspective of stochastic optimization settings. 

Notation — In this paper vectors are column vectors and denoted by a; = 
(xi,..., Xn), where xi,...,Xn are the coordinates of x. Subscripts are reserved 
for vector coordinates. The transpose of a vector a; S is denoted by x' and 
its Euclidean norm by ||a;||; for any M G symmetric, positive dehnite, 

we dehne the scaled norm ||a;||M := {x'Mx)~ 2 . Let S' be a finite set, {a^} a 
sequence in S, and a G S. We write ^ a iff there is a fc such that = a 
for fc > fc. Similarly, for A C S and a sequence {A^} of subsets of S, we 
write ^ A there is a fc such that A^ = A ior k > k. 


2 The gradient projection algorithm 

2.1 Formulation 

Consider a closed convex subset X of a real vector space R.™ and a function 
/ G F{m), where, for any space Rp, F{p) denotes the set of the functions 
RP I—>• R strictly convex, continuously differentiable with gradient V/ Lipschitz 
continuous. Lipschitz continuity of V/ can be understood as the existence of a 
symmetric, positive definite matrix L G Rp^p satisfying [V/(x) — V/(y)]'(x — 
y) < ||a^ ~ ylli for any x,y G Rp. It follows from this condition tha1o 

V/(x)'(j/-x) >/(y)-/(x)-i||x-y||i, Vx,?/eRP. (1) 

Let A and A be two positive scalar constants such that 0 < A < A < oo. 
For any real space Rp, we let T{p) define the set of the symmetric, positive 
definite scaling matrices in Rp^p with eigenvalues bounded by A and A, i.e. 
T(p) = {T G RP^P : XI F T F XI}. We consider the following algorithm. 

Algorithm 1 (Scaled gradient projection Consider a closed, 

convex set A C Rp, a function f G F{p), a scaling mapping T : F(p) x X 

^ To show Q, use for instance f(y) = f(x) + Vf{yY{y - x) + [V/(a; -|- Cv “ a:)) - 

^f(y)Y{y - 
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T(p), fixed scalar parameters I3,cr € (0,1), and an initial point & X. A 
scaled gradient projection algorithm is given hy 

^k+i fc = 0,1,2,..., ( 2 ) 

with defined for x G X by Q'^"^'^\f,x) := x{a), where 

x{a) G argminygx yf{x)'{y - x)+ ]^\\y - Va > 0, (3) 

and a is an appropriate step size bounded above 0. 

Any point x € X such that x) = a; is called stationary. Since by 

assumption / is convex, the stationary points coincide with the solutions of 
the minimization of /. The first-order optimality condition of a point x G X 
is therefore given by 

V/(x)'(y - a;) > 0, Vy e X. (4) 

If / is strictly convex, 0 holds for at most one point and there is at most one 
solution. Notice that condition (jH) reduces, for the subproblem ([3]) and any 
step size a > 0, to 

[Vf{x)+ [aT{f,x)]~'^{x{a) - x)]'{y - x{a)) >0, VyGW. (5) 

The notion of ‘gradient projection’ in Algorithm [1] can be explained by the 
observation that x{a) in ([3|) coincides with the scaled projection on A, 

x{a) G argminj^gjf Wv - 4 tu,x)-^^ (6) 

of the vector z = x — aT{f,x)Vf{x) obtained by scaled gradient descent 
from X. It follows from the convexity of X and from the projection theorem [U 
Proposition 3.7 in Section 3.3] that x(a) is uniquely defined in ([3]) and ([3]). 

Global convergence of ([3]) is commonly guaranteed by using an approximate 
line search rule of the type Armijo [3], which consists of setting a = a(f,x^) 
where, for x € X, a(f,x) is defined as the largest a G {/3™}m=o satisfying 

f(x) - f(x(a)) > cr\\x{a) - a;||. (7) 

From [3] we know that the step-sizes computed by © are restricted to a 
set [a, I], where a > 0 is a function of the Lipschitz constant of V/. When the 
algorithm is appropriately designed, © becomes asymptotically trivial. This 
is illustrated by the next result, shown in the Appendix, where L denotes the 
Lipschitz constant in the sense of ©• 

Proposition 1 (Line search efficiency). Suppose that Algorithm[l\ is im¬ 
plemented with the step-size selection rule and generates a sequence {a;^} 
converging to a stationary point x*. If 

2(l-cr)r(/,x)“^ ^ L, VxGA, (8) 

then a{f,x^) = 1 for all k. IfT{f,-) is continuous, f is twice continuously 
differentiable in a neighborhood of x*, and 

2{1 - a)T{f,x*)-^ ^X^f{x*), 


then a{f, x^) —>■ 1. 


( 9 ) 
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2.2 Descent in polyhedral sets 

Throughout the paper we consider the following problem. 

Problem 1. Solve 

minjjgjf f{x) (10) 

where f G F(rn), V/ satisfies (QP with Lipschitz constant L, and X is the 
nonempty polyhedron X ■.= {x & | a'^x < &i, a^x < & 2 , o!pX < bp}, with 

Oi,Op S and bi, ...,bp €lS.. 

The affine constraint functions in Problem [T] can be rewritten as Cj (x) < 
0, where Cj{x) := a}x — bj and Vcj{x) = aj for x G (j = l,...,p). A 
constraint Cj{x) < 0 is said to be inactive at a point x G K"* if Cj{x) < 0, and 
active if Cj{x) = 0, in which case we write j G A(a;), where A(x) C {1, 
denotes the index set of the active constraints at x. If a constraint qualification 
holds for Problem[T](e.g. Slater’s condition [3), then the first-order optimality 
condition o for a point x* G X translates, in accordance with the Karush- 
Kuhn-Tucker (KKT) conditions, into the existence of nonnegative coefficients 
{aj}j&A{x*) satisfying 

V/(x*) = -E,e^(.)«,Vc,(a;*). (11) 

Frequently, a solution x* G X oi Problem [T] will meet the stronger condition 
that (HU holds for positive coefficients {ctj}jeAix*)- In that case we say that 
strict complementarity holds at x *—thus extending to polyhedra a notion dis¬ 
cussed in in the context of bound-constrained optimization—, and it follows 
that A(a:*) is identified in finite time by the gradient projection algorithm. 

Proposition 2 (Identification of the active constraints). Assume that 
Problem}!} admits a solution x* where strict complementarity holds. Then one 
can find a 5 > 0 such that x)) = A(a;*) for any x G X satisfying 

||a: —a:*|| < 6. Moreover, any sequence {x^} generated by Algorithm}!} and 
converging towards x* is such that A{x^) —>■ A(a:*). 

The proof is given in the Appendix. A consequence of Proposition [5] is that, 
under strict stationarity at a point of convergence x*, local convergence occurs 
in a subspace {x G K.™ | a'x = bj, j G A(x*)} with dimension m < m, 
called the reduced space at x*, and orthogonal to the normal vectors of all the 
active constraints at x*. By E{x*) we denote any matrix whose columns form 
an orthonormal basis of the reduced space at x*. For any x G X such that 
A{x) = A{x*), there is a unique vector x G R’" satisfying x = x*-|-£’(x*)x. The 
following result states that the gradient projections reduce to mere gradient 
descents in the vicinity of x* and derives asymptotics for We denote 

by I the identity matrix in 

Proposition 3 (Descent in the reduced space). Let x* he a solution of 
Problem}^ where strict complementarity holds, and consider Algorithm}!} Any 
vectors x,y G X such that y = (/,x) with step size d, and x,y G R.™ 

such that X = X* + E(x*)x and y = x* + E(x*)y, satisfy 

y = x-d'=f(/,x)V/(x), 


( 12 ) 
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where V/ := E{x*)'Vf and T{f,x) := [E{x*)'T{f,x)~'^E{x*)]~^. 

Further, if f is twice continuously differentiable and T{f,-) is continuous 
at X* , then 


y = 


i - af{f,x)V‘^f{x) x + p{f,x) 


(13) 


where V^/(x*) := E{x*)'V'^f{x*)E{x*) and p{f,x) = o(||i||). If f is smooth 
and T{f, ■) is continuously differentiable at x*, then the remainder rewrites as 
p{f,x) = g{f,x){x — y){x — y)', where g{f,x) is a function of the derivatives 
at X ofV^f and T{f, •) uniformly bounded in a neighborhood ofO. 


Proof. Since the proposition is trivial when m = 0, we suppose that m > 0. Let 
X* := {x G X\ A{x) = ^(cc*)}, and X := {E{x*)'{x — x*) | x G V*}, which 
is an open subset of R™ by continuity of the constraint functions. Moreover, 
the function h{x) := x* + E(x*)x is a bijection between X and X*, i.e. X* = 
{h{y)\y G X}. It follows the assumptions on x and y that 

yi argmin^g^ |v/(x)'(^-x) + (14) 

^x-af{f,x)Vf{x) (15) 

where (HD) follows from the fact that X is an open set containing y, thus y is 
the projection on X of only one point: itself. 

The remaining statements follow directly from (1121) and Taylor’s theorem, 
by linear approximation at x* of the displacement d{x) := (/, x) — x. □ 


2.3 Parallel analysis and coordinate descent 

This section considers parallel implementations of Algorithm [T] for Problem [TJ 
where assumption is made that A is a Cartesian product set. 

Assumption 1 (Parallel analysis). The feasible set of Problem[l\ is given by 
X — AiX...xA„, where each Xi is a polyhedron mR™* and toi + ...+to„ = m. 
The Lipschitz continuity ofVf is considered coordinate-wise and m holds for 
L = diag(Li, ...,L„). 

Assumption [T] implicitly defines a set N = {l,...,n} of coordinate directions 
with respective dimensions mi,..., m„, suggesting parallel optimization by co¬ 
ordinate descent. 

2.3.1 Coordinate descent 

In this study the optimization of / G F{m) at a point x G X along a particular 
coordinate direction i G N is symbolized by the function fi,^ G F(mi) obtained 
from /(x) by fixing the other coordinates, i.e. 


fi-.x{y) ■■= f{xi,...,Xi-i,y,Xr+i,...,Xn), Vy G R™', f G A. (16) 
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Using (m , we consider in each direction i a scaling mapping Ti : F{mi) xXi 
T{mi) and define the associated coordinate gradient projection mapping 

Qi{f,x) ■- VxeX, ieV. (17) 

Based on (HZl), we formulate a synchronous coordinate descent algorithm, mod¬ 
eled on the Jacobi method, as = J{f,x’^), where 

J{f,x):=igi,g 2 ,...,gn)if,x) Vxgx. (is) 

Notice in m that the coordinate descents are applied simultane¬ 

ously along the n directions. Global convergence of JJ is conditioned by its 
quality to guarantee sufficient descent along the considered function at each 
step, which, in the general case, requires not only synchronism for the applica¬ 
tions of t/i,..., t/„, but also consensus at the global level on the choice of scaling 
matrices and step-sizes in each direction m- An alternative is to process the 
coordinate descents sequentially, using the directional mappings 

gi{f,x) := {xi,...,Xi-i,gi{f,x),Xi+i,...,Xn), \/x€X,iGN. (19) 

A cyclic coordinate descent algorithm can then be designed by applying the co¬ 
ordinate descent mappings in a predefined order as in the Gauss-Seidel method, 
i.e. = S{f,x'‘), where 

5(/,a:) := (^„ o • • • o 02 o f?i)(/,a:), Va; G A, (20) 

and o denotes the composition operator, defined for any i,j G N by (0i o 
Sj){f,x) := gi{f,gj{f,x)). The global convergence of S is guaranteed by ap¬ 
proximate line search in each coordinate direction, i.e. using, for i € N and at 
every x G X, the step sizes a{fi-,x,Xi) specified by ([7]). One shows that conver¬ 
gence is conserved when the mappings 0i are applied in random order provided 
that each coordinate direction is visited an infinite number of times [7] . A ran¬ 
dom coordinate descent algorithm is given by x^^^ = TZ^{f,x^), where 

x) := g^. (/, x), Vx G A, /c = 0,1, 2,..., (21) 

and ^ sequence of coordinate directions randomly selected in N, so 

that each (j)^ is a realization of a discrete random variable defined on a prob¬ 
ability space {N,2^,Tr), with tt = (tti, ..., tt^) G (0,1)". More sophisticated 
parallel implementations of g involving (block-) coordinate selection routines 
(e.g. Gauss-Southwell methods 0) can also be devised. 

For any parallel implementation of Algorithm [T] designed with coordinate 
scaling we let T{f,x) := dia.g{Ti{fi,x,xi), ...,Tn{fn:x,Xn)) for x G X. 

Proposition [T] extends to coordinate descent, and the step sizes computed 
by m in each direction reduce to 1 if ([5|) holds [7]. If the scaling mappings 
are continuous and / is twice continuously differentiable, the Hessian V^/ = 
(V?/) may be regarded as a block matrix and decomposed into V^/ = — 

where := diag(Vfi/,..., V^jj/) is block diagonal and 
strictly lower triangular. The step sizes of the parallel algorithms then reduce 
to 1 in the vicinity of x* if 2(1 — cr)Ti[fi,x*,x*)~^ >~ Xf^f{x*) holds for i G N, 
i.e. if 


2{1 - a)T{f,x*)-^ ^Xy{x*). 


(22) 
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2.3.2 Asymptotics 

By extension of Proposition[2]([7]), one shows the existence of a reduced space 
where every sequence generated by a parallel implementation of Algorithm [T] 
and converging to the solution x* accumulates under strict complementarity 
for / at X* or, equivalently, strict complementarity for fi-^* at x* in every 
direction i G N. The reduced space at x* = (a:|,..., a:*) is a Cartesian product 
and the matrix E{x*), introduced in Section 12.21 takes the block-diagonal 
form E{x*) = diag(Ai(a::^),A„(x*)), where the columns of Ei{x*) form an 
orthonormal basis of the reduced space at x* in the coordinate direction i. 

Under Assumption[Tl V/ rewrites as the n-dimensional composite compos¬ 
ite vector V/ = (Vi/,V„/). For x G X,we define X^f{x) := E^{x*yVif{x) 
{i G N). Similarly, setting T{f,x) := [E(x*yT{f,x)~^E{x*)]~^ yields the 
block-diagonal form T{f,x) = dmg(Ti{fi:^,xi), ...,Tn{fn-.x,Xn)), where the 
diagonal elements are given by fi{fi,^,Xi) := [Ei{x*yTi{fi,^,Xi)~'^Ei{x*y\~^. 

In the reduced space at x*, a gradient projection y = Qi{f,x) with step 
size a in direction i G N at a point x G X near x* reduces, by translation of 
Proposition[2]into the coordinate descent framework ([7]), to a: = x* + E{x*)x 
and y = X* + Ei{x*)y for some x G K™ and y G satisfying 


y = Xi - dTi{fi,^,Xi)Vif{x), t/iGN. 


(23) 


If / is twice continuously differentiable and T{f,-) is continuous at x*, 
then (1^ asymptotically reduces to 



where V^f{x) := E{x*yV^f{x)E{x*) and A denotes the identity matrix 
in Similarly, we write V^/ = where f{x) := 

Eix*yX^f{x)E{x*), and X^f{x) := Eix*)'^'^f{x)E{x*). 

3 Asymptotic convergence rates 

The developments of this section rely on the following assumption. 
Assumption 2. Problem[J\has a unique solution x* where strict complemen¬ 
tarity holds and in the vicinity of which / is twice continuously differentiable. 
Terminology — We derive asymptotic rates of convergence for the algorithms 
of Section [2.3l bv first-order sensitivity analysis around x*. Our aim is to find a 
matrix iJ which satisfies an equation of the type h{x^'^^) — Hh{x'^)-\-o{h{x^)) 
for some continuous function h and any sequence {x^} generated by the con¬ 
sidered algorithm. If this equation holds for iJ < I, we say that {h(x^)} 
converges towards h{x*) with asymptotic rate El. Convergence is called sub- 
linear if = 1, and linear \i El < \. It El satisfies the inequality h{x^^^) < 
Hh{x^) + o{h{x^)) for any generated sequence and the algorithm may produce 
a sequence for which the latter inequality holds with equality sign, then we 
speak of convergence with limiting asymptotic rate H. 
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Proposition 0] extends to polyhedral sets and arbitrary scaling a result 
derived in for a cyclic coordinate descent algorithm used in the non-negative 
orthant with coordinate-wise Newton scaling. The proof provided in this paper 
is arguably simpler and the requirements less restrictive. The spectral radius 
of any matrix M defined as the supremum among the absolute values 

of the eigenvalues of M, is denoted by p{M). 

Proposition 4 (Asymptotic convergence rate of <S). Let Assumptions\^ 
and\^hold for Problem[^ Consider the cyclic algorithm — S{f, x^) imple¬ 
mented with the step-size selection rule ^ and with scaling {Ti{fi.j;,Xi)}i^N 
continuous at x*, and satisfying condition \22il for the step sizes. Any se¬ 
quence {x^} generated by the algorithm converges towards x* with asymptotic 
rate E{x*)S{f, x*), where 


-1 r 


s{f,x*)= T{f,x*)-^-^y{x*) T{f,x*r^-wy{x*) + ^y{x*y 


(25) 


with p{S{f,x*)) < 1, while \f{x^) — f{x^)\ vanishes with limiting asymptotic 
rate p{S{f,x*))^. //V^/(x*) is positive definite, then p{S{f,x*)) < 1. 


Proof. It follows from the assumptions, Proposition [T] and ( 1 ^ 51 ) . that one can 
find a. k < oo such that, for any k > k, we have x^ = x* -\- E{x^)x^ and 
^k+i _ E{x*)x^'^^ for some G R"*, with 

= Gn{f,x*) Gn-i{f,x*) ■ ■ ■ Gi{f,x*)x'^ -f o(||i'=||), (26) 

where 

G,(/,x*) :=7-diag(0,...,0,J„0,...,0)f(/,x*)VV(a;*), Vi G A, (27) 

embodies the effect of a gradient projection along coordinate direction i. Ap¬ 
plying Lemma 0] from the Appendix yields Gn{f, x*) ■ ■ ■ Gi{f, x*) = S{f, x*). 

Consider the sequence of function values {f{x^)}. It follows from Propo¬ 
sition [2] and m that, for k large enough, we have V/(x*)'(a:^ — x*) = 0. 
Setting x’^ := V^f{x*)^x’^ and S{f,x*) := V^f{x*)^S{f,x*)V^f{x*)-i, the 
Taylor theorem yields 

/(x'=+i) - f{x*) 

= ^(x'^+i - x*yv^fix*)ix>^+^ - X*) + o(||x'=+i - x*||2) (28) 


= i(x'=+i)'V2/(a;*)i'=+^ + o(||i'=+i||2) (29) 

i [Sif, x*)x'=]'V2/(x*)5(/, x*)x>^ + odlx'^ f) (30) 

= i||5(/,x*)x'=f+ o(||i'=||2) (31) 

< ip(5(/,x*))2||x'=||2-ho(||x'=|p) (32) 

^p{S{f, x*)f{f{x^) - fix*)) -b odlx'^ - x^lp). (33) 


We now characterize piSif, x*)). First assume now that V^/(x*) is positive 
dehnite. Observe that S{f,x*) = {D — E)~^E', where D = 2T’(/, x*)“^ — 
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# 2 /( 2 ;*) and^ = f{f,x*)-^ -W^f{x*) + §^f{x*). Noting that D-E-E' = 
V^/(a;*) is positive definite and {D — E) is nonsingular, the O straw ski-Reich 
theorem [101 Theorem 3.12] states that p{S{f,x*)) < 1 if D >- 0, i.e. if 

2f(/,x*)-V\V2/(x*). (34) 

Because (EH) implies we infer that p{S{f,x*j) < 1, and the algorithm 
converges linearly. 

If, however, V^/(x*) is only positive semi-definite, then by computing un¬ 
der (EH) the asymptotic rate for the function /(a:)-|-|||x||^ with Hessian V^/-|-e 
and letting e 0, we find p{S{f,x*)) < 1 by continuity of the eigenvalues 
of S{f,x*) with respect to V^/(a;*), which completes the proof. □ 

Remark 1. Since condition © is the conjunction of n conditions verifiable 
along individual directions, i.e. 2{l — a)Ti{fi:x*, x*)~^ >- Vi^f{x*) for i =€ N, 
the cyclic algorithm S is an attractive candidate for distributed optimization. 

Remark 2 (Coordinate-wise Newton scaling). When Newton scaling 
is used in each direction, i.e. Ti{fi.x,Xi) = for i = l,...,n or, 

equivalently, T{f,x) = V^/(x)“^, (El holds at the point of convergence, 
and the asymptotic convergence rate (1^ reduces to S{f,x*) = [V^/(a;*) — 

Remark 3. In the case when condition (El is not met and a(/, •) is discontin¬ 
uous at X*, then S then proves to converge locally like a stable discrete-time 
switching system defined by a rate set {S^'^\f,x*)}p^^ with (/, x*)) < I 

(or p(5[’^l(/, X*)) < 1 if V^/(x*) is positive definite) for all ip € 'P, reducing 
for large k to o(||x^||), where ip{-) is a switching 

function. 


3.1 Synchronous implementations 

When implemented with identical step sizes in all coordinate directions, the 
synchronous algorithm J proves to be equivalent to Q endowed with block- 
diagonal scaling. We directly infer from (fTH the asymptotic convergence rate 
of J, or by setting u = I in Proposition EH 

Proposition 5 (Asymptotic convergence rate of J'). Let Assumptions\^ 
and\^hold for Problem[^ Consider the synchronous algorithm = S{f, x^) 
implemented with step size I in every coordinate direction and with scal¬ 
ing {Ti{fi.x,Xi)}i^N continuous at x*. //{x^} in a sequence generated by the 
algorithm converging towards x*, then {x^} converges with asymptotic rate 


j{f,x*) = i-f{f,xnv^f{xn. 


( 35 ) 
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Remark 4. In contrast with Remark [1] global and linear convergence of J 
may be difficult to assess by inspection along individual directions. In the par¬ 
ticular case when rrii = 1 for i £ N and ii is satisfied, then p{J{f,x*)) is 
jointly characterized by Proposition 0] and the Stein-Rosenberg theorem |101 
Theorem 3.8], which claim that either p{S{f,x*)) = p{J{f,x*)) = 1, or 
p{S{f,x*)) < p{J{f,x*)) < 1 when V^f{x*) is positive definite, in which 
case any convergent sequence {x^} generated by the algorithm J' converges 
linearly. Notice that convergence is then asymptotically faster for the cyclic 
implementation S than for the synchronous implementation 


Synchronous algorithms based on approximations of the Newton method — 
A particular approach explored e.g. in ummM is to find a compromise be¬ 
tween the computational and organizational attractiveness of the coordinate 
descent methods, which set T(/, x) = V^/(x)“^ for x G A under strong 
convexity of / and converge linearly, and the quadratic convergence of the 
centralized Newton method, for which T(/, x) = V^/(x)“^. In these stud¬ 
ies V^/(x) is assumed to be sparse and such that the quantity Q(/, x) := 
y^/(x)“ 2 [S7^/(x)-|-V^/(x)']V^/(x)“ 2 can be computed in a distributed man¬ 
ner, while the inverse of the Hessian of / rewrites as the series 

provided that p{Q{f,x)) < 1, which holds under a strict diagonal dominance 
condition for V^/(x )“2 V^/(x)y^/(x )“2 in virtue of the Gershgorin circle 
theorem [1^ . The approach suggested by dMl) is to generate vector sequences 
such that x^+^ = x^), where with scaling strategy 

T{f,x)=W^f{x)-i[j:toQ{f,xyW^f{x)-i, (37) 

and g is a parameter symbolizing the implementability vs. rapidity trade¬ 
off, and directly proportional to the computational complexity of the algo¬ 
rithm. By setting ([H7| in (ESI), we obtain for {x^} the asymptotic convergence 
rate E{x*)Z^'^^{f,x*), where 

Z[«l(/,x*) = f(/,x*)A(x*)'r(/,x*)-iz[«l(/,x*)A(x*) (38) 

and Z['^l(/,X*) := y^/(x)“^Q(/, 2;*)'^^^V^/(x)5 is the asymptotic conver¬ 
gence rate for the unconstrained problem (i.e. X = R"*). It can be seen that 
p(^[^l(/, X*)) vanishes with growing q. When q = 0, (155)) reduces to the rate 
of J7 with coordinate-wise Newton scaling. 


3.2 Random implementations 

We consider the asymptotic convergence of the random algorithm {77.^} given 
in ED and used with probabilities cj)^ ^ tt = (tti, ..., tt^) for the coordinate 
directions. In this context we formulate a strong convexity assumption. 
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Assumption 3. The function f is strongly convex so that there exist a sym¬ 
metric, positive definite block matrix U = (Uij) with Uij € satisfying 

[V/(a;) - Vf{y)y{x - y) > ||a: - yWfj for every x,y eX. 

In polyhedral feasible sets, the local convergence of a generated sequen¬ 
ce {x^} converging to a solution x* where strict complementarity holds occurs 
in the reduced space at x*. In that case we find from (IMl) and for k large, 
X* = X* -I- E{x*)x^ and = x* -I- E{x*)x^^^, with x^,x^“''^ € K™ and 

x'=+i'5'g^.(/,x*)x'=+o(||x'=||). (39) 

The expectation of (IMl) in (j)^ gives E [x^+^| x^,9^] = R{f,x*)x^ o(||x^||), 

where R{f,x*) := / — diag( 7 riJi,..., • 7 r„ 7 „)T'(/, x*)V^/(x) and 9^ symbolizes 
the event that {x*}^^, is confined in the reduced space at x*. In order to derive 
asymptotic convergence rates for {7^^}, we need to find out what happens 
when 9^ is false, ideally making sure that [1—Pr(0*)]E [/i(x^+^) | h(x^), = 

o{h{x^)) for some residual h{-). This information can be partially inferred from 
the following lemma, which extends to arbitrary distributions a convergence 
result derived in [H Theorem 5] for the algorithm known as UCDM, which is 
a version of {TZ^} using fixed scaling in each direction and equal probabilities 
TTi = - for all directions i G N. 

Lemma 1 (Convergence of {7?.^}). Assume that Problem[J\ has a unique 
solution X* and that the feasible set is the Cartesian product X = Xi x ... x 
Consider the cyclic algorithm x^'^^ = TZ^{f,x^) implemented with ^ ir = 
(tti, ..., 7 r„) at every step k, the step-size selection rule & where cr < 5 , and 
fixed scaling T{f, x) =T = diag(Ti,..., Tn) with R ■< L~^ for i G N. Define 

E :x GX 1 -^ E{x) := ^||x - x*\\y + /(x) - /(x*) e R>o, (40) 

where tt := min{ 7 ri,..., 7 r„} and V := [ndiag( 7 ri'Zi, ..., 7 r„T'„)]“^. For any se¬ 
quence {x^} generated by the algorithm, we have 

E [/(x'^) -/(x*)] < ^^tf'(x°), fc = 0,1,2,.... (41) 

If, in addition, f is strongly convex as in Assumptions^ then 

E[.f'(x'=+i)|x'=] < fl-^^)^(x'=), fc = 0,l,2,..., (42) 

' \ y -|- riTT y 

where the constant u > 0 satisfies uV ^ U. 

The proof is similar to that of m Theorem 5] and reported in the Appendix. 
We are now able to characterize the convergence of the algorithm in polyhedra. 
Proposition 6 (Asymptotic convergence of {TZ'^}). Let Assumptions[J\ 
and hold for Problem\^ Consider the cyclic algorithm x^'^^ = TZ^{f,x^) 
implemented with (pf ^ tt = (tti, ...,7r„) at all k, the step-size selection rule 0 
where cr < ^, and fixed scaling T{f,x) = T = diag(Ti,..., T„) with R ^ 
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L- ^ for i G N. For any sequence {x'^} generated hy the algorithm, E [/(a;^)] 
converges towards f{x*) with limiting asymptotic rate 

:=p(Er=i^*G.(/,x*)'VV(x*)G.(/,x*)V2/(^*)-') < 1. (43) 

//V^/(a;*) is positive definite, then R^^'^{f,x*) < 1. 

Moreover, if f is strongly convex as in Assumptionl^ then E van¬ 

ishes at least linearly with limiting asymptotic rate 

(44) 

where V = E(x*yVE{x*), and F and V are defined as in 

Proof. Let {x^} be a sequence generated by the algorithm. Proposition [5] 
claims that one can find a 5 > 0 such that A{x*^) = A{x*) for t > fc + 1 
when \\x^ — x*|| < 5 . If f^ := max{/(x) | ||x — x*|| < d,x G X} and is de¬ 
fined as above as the event that A{x*) = A{x*) for t > fc, it follows from 
Lemma [T] that 

Prle'-^) > Pr(||x‘ - x*|l < «) > Prl/Cx'-) < f ) f 

(45) 

Hence Pr(0^+^) —1. From Proposition [U we also know that the step sizes are 
equal to 1, and from ( 1 ^ that, when 9^ is true, then x’^ = x* E{x*)x^ and 
_ 3,* _j_ E{x*)x^'^^ for some G R™ satisfying (IMl) . 

Consider the sequence of function values {/(x^)} and a step k. If 9^ is true, 
we have Vfix*Y{x'‘-x*) = 0. By setting x^ := X‘^f{x*)ix^ and Gi(/,x*) := 
fix*)iGYf, x*)X^ f{x*)-i for i G N, and proceeding as in ([281)-([331), we 
find 


/((r'=+i)-/((r*) = i(i'=)'G^.(/,(r*)'G^.(/,x*)f'=+o(||(r'=f). (46) 

Thus, 

E [/(x'^+i) -/(a;*)|a;^6»'=] 

i(i'=)'Er=i^44(/,x*)'G,(/,x*)](r'= + o(p'=||2) (47) 

< Mfyf,x*)[f{x'^) - fix*)] + oifix'^) - fix*)) (48) 

where .R['^1 (/,x*) is given by (HHll . If, however, 9^ is false, then [/(x^+^) — 
fix*)] < [fix'^) — fix*)] by d?]). All in all, we have 

E [/(x'^+i) - /(x*)| x'=] < mif,x*)[fix>^) - fia:*)]+v'^ + oifix'^) - fix*)), 

(49) 

where = [I — Pr(0'')][l — i?[-^l(/, x*)][fix^) — fix*)] = o(/(x^) — fix*)), and 
the rate R^^^if,x*) is tight. 
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Let Assumption 12] hold. Similarly, one can write — x*\\y = 

when 9^ is true. Using ( 1211 ) and (H51) . one finds 

1421 

< R^'^\f,x*)<l'ix'^)+v'^+ o{<l'{x'^)), (50) 

where = [1 — Pr(0^)][l — 2ttu{u + nu)~^ — , x*)\I'{x^) = o{'I'{x^)), 

and X*) is given by and tight. It follows from Lemma [1] and 

that R'^'^\f,x*) < 1, and thus R'^^\f,x*) < 1. Otherwise there would exist a 
vector = X* + E{x*)ey G X such that R^'^\f,x*) > 1 and (l50ll holds with 
equality sign, which contradicts (1421) if we take e small enough. 

Assume now that / is not necessarily strongly convex, yet V^/(a;*) is posi¬ 
tive definite. Because M^\f,x*) depends only on local properties of /, apply¬ 
ing the same algorithm within A to a strongly convex function g with the same 
derivative and Hessian as / in a neighborhood of x* will see R\g{x^'^^) — g{x*)] 
converge with asymptotic rate R^^'^ (/, x*) < 1. When V^f{x*) is positive semi- 
definite, we find M-^\f,x*) < 1 by considering the function / -|- §||a;|p and 
using the same continuity arguments as in the proof of Proposition 01 □ 


Remark 5. Let us compare the rates given in and (I44|) . For this purpose 
we place ourselves in the conditions which optimize the precision of (14211 by 
supposing (i) that the bound u is tight, in the sense that uV :< U is satisfied 
with equality sign (thus implying that U is block diagonal), (ii) that V^/(x*) 
is defined and equal to U so that the strong convexity constant U is itself tight 
and we have the constraint u < nn imposed by T ^ L~^ ^ U~^, and (iii) that 
A{x*) = 0. After computations, we find R^'^\f, x*) = 1 — ^(2 — ^). The ratio 
with the rate ( 0 ^ gives 


i-m{f,x*) 

\ u+nTT j 


= 1 + 


uin'K — u) 
2(n7r)2 


> 1 . 


(51) 


It can be inferred from (ED that the rate (HU) is (for this problem) generally 
conservative, and equal to the asymptotic rate R^'^^(f,x*) iff y = utt holds, 
i.e., when when we use a scaled version R = ^^iif{x*)~^ {i G N) of the 
asymptotic expression of coordinate-wise Newton scaling approach previously 
discussed in Remark [5]— in that case the convergence rate reduces to 1 — tt. 


3.3 Non-twice differentiable cost functions 

In the previous sections we have assumed that V^/ existed at the point of con¬ 
vergence X*. Suppose instead that V^/(a;*) is not defined but that / satisfies a 
strong convexity condition at least locally in a neighborhood X* of x*, i.e. there 
is a symmetric, positive definite matrix U such that [V/(x) — V/( 2 /)]'(x —y) > 
||x — y\\fj holds for x, j/ G X*. 

Under Assumption |D consider any algorithm based on such as 

those introduced in Section EH and generate a sequence {x^} with step-size 
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selection rule © asymptotically efficient in the sense of Proposition[T] Assume 
that strict complementarity holds at x*, so that convergence occurs in the 
reduced space at x* and, for large k, we have x^ = x* + E{x*)x’^ with G R™. 
In view of Remark [3] and by considering directional derivatives of V/ in the 
developments that lead to (1331) . we find 

&(/, x'^) -x*= - X*) + oiWx'^ - x*||) (52) 

where G'^ := E{x*)[i - diag{0, ...,0)f{f,x*)E{x*yM’^E{x*)] for 

some matrix G E, where E denotes the set of all the symmetric ma¬ 
trices M satisfying U < M < L}. Suppose now that, for any strongly convex 
function g which realizes its minimum on X at x* and satisfies Assumption [3J 
the algorithm produces sequences linearly convergent towards x* with re¬ 
spect to some residual h[ij^) and with rate E[{V^g{x*)), i.e. 

< H{V^g{x*))h{y’^) + o{h{y’^)), (53) 

where p{H{M)) < 1 for any M € E, and H{-) is a continuous mapping. It 
follows from the compactness of E that we can find a matrix E[ ^ E such that 
p{H) = m.ayiM&s{p{H{M))} < 1 and 

h(x'=+i) < Hhf.x'^) + o{h{x^)). (54) 


3-4 Stochastic optimization based on gradient projections 

Some stochastic optimization problems are concerned with the minimization 
of a cost function unknown in closed form that can only be estimated through 
measurement or simulation. Assume in Problem [T] that / is unknown, while a 
sequence {/^} of estimates in F{m) is available for / with common Lipschitz 
constant for every V/^, and that {/^} converges almost surely towards / in 
the sense that [sup 2 ,g(;^ \f^{x) - f{x)\ + sup,rgc' - '^fix)\\] vanishes 

almost surely for any compact set G C X. An approach for solving this problem 
consists of sequentially applying an iterative optimization algorithm M along 
the sequence of function estimates, i.e. 

= M(/'=,x'=), fc = 0,1,2,.... (55) 

The bounded sequences {x'^} generated by (1531) are known to converge almost 
surely towards a nonempty solution set provided that A4 is closed and a descent 
algorithm with respect to / and the set of solutions m Theorem 2.1]. Possible 
choices for A4 include the gradient projection algorithm and (under 

Assumption [ij the parallel implementations of Section [2.31 whose convergence 
in stochastic settings is specifically addressed in m- 

Consider such an algorithm A4, and suppose that strict complementarity 
holds at X* (Assumption [3]) and that each function has a unique mini- 
mizer y^ on X where is defined, continuous and positive definite at y^. 

By Lipschitz continuity of /, the sequence {y^} is such that A{y^) —>■ A{x*), 
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i.e. = X* + E{x*)y^ for y^ S and, say, k > k, with strict comple¬ 
mentarity holding at y^ for /^. Assume that the considered algorithm A4 
produces, when applied to any with k > k, sequences in X converging 
towards y^ in the subspace at x* with rate M{f^,y^) < 1. Any bounded 
sequence {x^} generated by (1551) will then be such that x^ = x* + E{x*)x^ 
and x^'^^ = X* + E{x*)x^'^^ for k large enough and for some x^,x^~^^ G 
satisfying 

-f= M{f\ - f) + p(/^ (56) 

where p(/^, x) = o(||a; — y^\\) for k >k. Further, if / and all are smooth, the 
scale T{f,-) of Af is continuously differentiable at x*, and, almost surely, 
and its derivatives converge uniformly on a neighborhood of x* towards V^/ 
and its derivatives, respectively, then p{f^,x^) = g{f'^,x^){x^ — y^)[x^ — 
y^Y, where p(/*,x^) is a function of derivatives at x* of V^/ and T{f, •) and 
uniformly bounded for all /c > fc on a neighborhood of x* in accordance with 
Proposition [21 Then, (1^ rewrites (with probability one) as 

^k+i _ ^ ^ ^k^yk _ ^ (57) 

where A'^ = E{x*)M{f'^, y^)E{x*y, and = E{x*){i - M{f^, y^))E{x*)’. 

The asymptotics of {y^ — x*} ensue from the nature of the function se¬ 
quence {/^}. In many problems, / is an expectation function of the type 

fi^) = Ia9ix,uj)P{duj), VxeR™, (58) 

where w is a random parameter defined on a probability space (17,J^, P), 
and g{-,u}) serves as a random measurement of /, modeling for instance the op¬ 
timal value of the second-stage problem of a two-stage stochastic program m- 
Based on (l5^ and the simulation of a sequence of samples of 

independent realizations of w, with q{k) —>■ cjo as fc —oo, it is common to 
consider the sample average approximation (SAA) 

/c = 0,1,2,... , (59) 

which converges almost surely and uniformly towards g on any compact set C C 
X under certain continuity and integrability conditions for g |18j . The se¬ 
quence {y^} is then known as the (SAA) estimator, and it follows from the 
central limit theorem that (IMl) is asymptotically normal, i.e. 

q{k)-i[f'^{x) - f{x))] iy{x), Vx e A, (60) 

where A- denotes convergence in distribution and jz(x) is a centered normal 
random variable with variance cr^(x) = Var[g(x,w)]. Since the hypotheses 
of [T71 Theorem 5.8] are then satisfied at x*, the first order asymptotics of 
the SAA estimator y^ can be inferred from the second order Taylor series 
expansion of / at x* and the Delta theorem, and we find 

q{k)-^y'^ -x*] 4 -E{x*)X^f{x*)-^E{x*yVvix*), 


( 61 ) 
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where V^/(x) := E{x*y'\/'^f{x)E{x*). 

We see from (EZD and ((CTl) that the convergence of the sequence {x*} is then 
asymptotically analogous to that of a discrete-time random dynamical system 
characterized by (i) the affine mapping sequence {^4*}, which converges almost 
surely towards the asymptotic convergence rate A°° = E{x*)M{f^x*)E{x*y 
of the (typically linearly convergent) algorithm Ad, and (ii) a random noise 
process with variance vanishing sublinearly like 0{q{k)~^), thus hindering the 
whole optimization process and dictating its actual asymptotic performance. 

Remark 6. The impact of variance of the SAA estimator can be lessened using 
variance reduction [BHOIIIZ] or scenario reduction techniques m- Reducing 
the computational charge due to sample averaging is possible for instance by 
controlling the sample generation process [22) . or by synchronizing—possibly 
in parallel—the application of the descent algorithm (I55p with the increasing 
precision of {/^} [IB] . 


Appendix: proofs and auxiliary results 

Proof of Proposition^ Consider any x X and the gradient projection (f, x) with 

a tentative step size a £ (0,1]. We have 

f(x) - f(x(a)) > -Vf(xy(x(a) - x) - i||x(a) - x\\l 

0 

> (iE(a) — x)'M{x{a) — x) 

with M := [aT{f,x)]~^ — and by ((Sj the condition ([TJi is satisfied for a = 1. 

Suppose now that T(/, •) is continuous and / is twice continuously differentiable in a 
neighborhood X* of x*. Taylor’s theorem yields 

fix) - f(y) = -V fix)' (y - x) - ^IIj/ -a:||v 2 y(„,) + oi\\y - a;|p), \/x,y G X*. (64) 

Consider the sequence converging to x* and the sequence such that = 

^(^’^) (/, with step size 1 for all k. Since x*) = x* for any step size by 

stationarity of x*, we find that y^ x* by continuity of Thus, for k large enough, 

x^,y^ £ X*^ and it follows from lO and ll64ll that 

fix'") - fiy'") > \\y'" - X>"\\‘^ + o{\\y>" - (65) 

with Q := T(f,x'")~^ — fix'"). By JHJ and continuity arguments, 10 is satisfied at x^ 

for large k if a = 1, i.e. x^'^^ = y^. Hence d(/, x^) —)■ 1. □ 

Proof of Proposition By strict complementarity at x* we know that CD is satisfied with 
coefficients {cfj}je- 4 (x*) positive. For any 5 > 0, denote by ^"*((5) := {x £ X \ \\x — a^*|| < 
(5} a neighborhood of x* in X. We first show that one can find a 5 > 0 such that A{x) C A{x*) 
for any x £ X*(6). Otherwise there would be j £ {1, ...,p}\^(x*) and a sequence {y^) in X 
converging towards x* such that Cj{y^) = 0 for all k. By continuity of Cj, we would find 
Cj{x*) = 0 and thus Cj £ A(x*)^ which is a contradiction. Since the proposition becomes triv¬ 
ial if A{x*) = 0, we suppose in the rest of the proof that A{x*) ^ 0, and thus ||Vi/(a:*)|| 7 ^ 0 
by strict complementarity at x*. 


(62) 

(63) 




Asymptotic convergence rates for coordinate descent in polyhedral sets 


17 


Consider a point x £ X where A{x) = B with B C A{x*) and B 7 ^ A{x*). The affine 
constraints can be rewritten as Cj{x) = XCj{x*)'(x — x*) for all x £ X and j £ A{x*). We 
find 


^jeA(x*) OljCjix) = Oj Vc^'(x*)'] (x - X*) ^ -V/(x*)'(x - X*). (66) 

Assume that X f{x*) is a linear combination of elements of {Xcj (x*)}j^Q- We have Cj (a;) = 0 
for j £ B and the expression in leeli is equal to 0. Since Cj(x) < 0 for j £ A(x*) \ B, we 
find ^ contradiction. Hence Xf{x*) cannot 

be expressed as a linear combination of elements of {Vcj(x*)}j^g and there exists a A > 0 
independent of x such that 

^{^j}jeB- (67) 

For 5 > 0, consider the function 9(5) := max (5, maxa.^x*( 5 ) (/? ^) ~ ^11) with any 

bounded scaling strategy T and step-size policy in [a, 1] (a > 0). Since = 

X*, we find by Lipschitz continuity of V/ and other continuity arguments that 0((5) 4^ 0 
whenever (5 4- 0. It follows that for any p > 0, one can find a (5 > 0 such that x £ X*{5) 
yields both A{x) C A{x*) and ^(<5) < p. By Lipschitz continuity of V/, we also have 
||V/(a:) — V/(x*)|| < l\\x — x*|| < Ip for any x £ X, where I denotes the Lipschitz constant. 
Suppose now that A{x) = B for some x £ X*[5) and set y = x) with step size 

a £ [a, 1]. From (Ell and ET}, we infer the existence of nonnegative coefficients {^j}j£B 
satisfying 

V/(x) + [aT{f,x)]-^{y - x) = - EjsS Wj (y). (68) 

Then, 


I v/(x*) + Ej6B &jVc{x*)\\ ||[V/(x*) - V/(x)] - [aT(f, x)]-i(j/ - x)|| 

^ ^]p) 

which contradicts isg if, initially, p < zi/[Z-h l/(aA)]. Hence A(y) 7 ^ B, which proves the 
first statement considering that the number of constraints p is finite. The second statement 
is then immediate. □ 


Proof of Lemma Q) We already know from Proposition ^ that the step sizes chosen by 0 
are equal to 1. The rest of the proof—herein provided for completeness and comparison— 
follows the lines of that of [g Theorem 5] with the difference that we reason with the 
norm IHIv. We have 


j.k+1 _ ^*||2^ ^ Ii^fc _ ^*||2^ 2(3.fe _ x*)'V(x’^+^ - x^) + ||x''+l - 


11 - 11“^ — 11 — X- J V — X J —1“ 11 a.. — X 11"^ 

= llx*^ - x*\\l + 2(x''+l - x*yv(x'‘+^ - x^) - ||x''+l - x^Wl, 
vfe -*11^ J_?_yy , _ lu^'+i , 


0 

< ||x'‘ - X 


Q 

< llx*' — X 


n-K^k 


V-/(x^)'(xE-x-r)-||x'=+i-x- 


*l|2 

V 


2 [y^kf{x>^nxl, - xjj + f{x>=) - /(x'^+l)] 


9 „ te 

< — X 


*\\2 

V 


^^^kf{x>^y{x*^,-x^^,) 2[/(xfe)-/(xfe+l)] 


nn.k 


which yields, by expectation in 4>^ and rearrangement of the terms. 


E [ip'(x'=+l)| s'"] 9 •P(x^) - 7rV/(x'“)'(x'' - x*). 


(69) 
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Since \/f(x^Y{x^ ~ ^ /(^^) ~ /(^*) t)y convexity of /, we find by computing successive 


conditional expectations, 

E -7lJ2Yo'^[f{x^) - fix*)] (70) 

< ^(x°) - n{k + 1)E [/(x'^+l) - fix*)] (71) 

which shows SB. 

When Assumption [3] holds, we proceed as in CJ and find, 

Vfix'^Yix* - x'^) < fix*) - fix^) - lux'" - x*\\l < -u\\x>^ - x*fy. (72) 

Substituting the two inequalities ((Z3 into iIMt with relative weights /i = 2u{u + nir) ^ £ 
(0, 1] and 1 — fi yields II42I1 . □ 

Lemma 2. Let H = (Hij) be a symmetric block matrix of such that H = D — 


L — L', where D = diag(Di, ...,Dp) is block diagonal and L is strictly lower triangular. 
If T = diag(Ti,Tp) is a symmetric, positive definite, block diagonal matrix of 
and Gi := Ip — diag(0, li,0, ...,0)TH for i = 1, then GpGp-i ■ ■ ■ Gi = {T~^ — 


Proof. Since the result is trivial for p = 1 we suppose that p > 2. We proceed by induction 
on p. Let Ml := Gi. For 2 < 2 < p, define Mi := GiGi—i ■ ■ • Gi and decompose T, D, L 
and the p x p identity matrix Ip into 


Ti 0 0\ Di 0 0\ Li 0 0\ h 0 0 

r=( 0 Ti 0 ,11= 0 Di o 0 0 ,4= 0/^ 0 

0 0 Tj \ 0 0 Dj \ Li Xi Lj \o 0 li 

We can write 

( h 0 0\ 

Gi = { Tili li — TiDi TiX'^ I , 2 < 2 < p. 

Vo 0 Li J 

For some 2 > 2, notice that Ti is nonsingular, as well as T~'^ — Li, and suppose that 

1. . = {ir~"-r)~HT-^-Di + L'i) if-^-Li)-Hk,li)' 

0 Ii_, 

' Ziif-^ -Di + L'i) Zil'i ZiJfi 

0 li 0 

0 0 Li 

where Zi := (T”^ — Li)~^. By block matrix inversion of T~yi — £i+i, we have 


T-^\-Li+, = ( f ) , 7i-\-D,+i+L'+i = ( - ^ 


Ti -Di + L'i 


It follows from II74II . II76II and Mi = GiMi—, that 

/ ZiiT-'^ - Di + L'i) Zil'i Zil'i 

Mi= TiliZilf-^ - Di + L'i) TiliZil'i + Li-TiDi TiliZiYi + TiX'i ), 

V 0 0 7 , 

12 ) f (f-y\ - Li+i)-HTrY\ - Di+i + L'i^i) (f-y\-Li+,)-Hk+idi+iy 

0 Li 


(73) 

(74) 

(75) 

(76) 

(77) 

(78) 

(79) 


where we have used (/i+i,A^). Since (1 7511 holds for 2 = 1, we find by induction 
Gp---Gi = Mp = (T-i - L)-i(T-i -D-\-L'). □ 
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